Fast Indexing and Visualization of Metric Data Sets using Slim-Trees

نویسندگان

Caetano Traina

Agma J. M. Traina

Christos Faloutsos

Bernhard Seeger

چکیده

ÐMany recent database applications must deal with similarity queries. For such applications, it is important to measure the similarity between two objects using the distance between them. Focusing on this problem, this paper proposes the Slim-tree, a new dynamic tree for organizing metric data sets in pages of fixed size. The Slim-tree uses the triangle inequality to prune distance calculations needed to answer similarity queries over objects in metric spaces. The proposed insertion algorithm uses new policies to select the nodes where incoming objects are stored. When a node overflows, the Slim-tree uses a Minimal Spanning Tree to help with the split. The new insertion algorithm leads to a tree with high storage utilization and improved query performance. The Slim-tree is the first metric access method to tackle the problem of overlap between nodes in metric spaces and to propose a technique to minimize it. The proposed ªfat-factorº is a way to quantify whether a given tree can be improved and also to compare two trees. We show how to use the fat-factor to achieve accurate estimates of the search performance and also how to improve the performance of a metric tree through the proposed ªSlim-downº algorithm. This paper also presents a new tool in the arsenal of resources of Slim-tree aimed at visualizing it. Visualization is a powerful tool for interactive data mining and for the visual tracking of the behavior of a tree under updates. Finally, we present a formula to estimate the number of disk accesses in range queries. Results from experiments with real and synthetic data sets show that the new algorithms of the Slim-tree lead to performance improvements. These results show that the Slim-tree outperforms the M-tree up to 200 percent for range queries. For insertion and split, the Minimal-Spanning-Tree-based algorithm achieves up to 40 times faster insertions. We observed improvements up to 40 percent in range queries after applying the Slim-down algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Efficient Indexing Method for Box Queries in NDDS Spaces using BoND-tree

Similarity searches in multidimensional Non-ordered Discrete Data Spaces (NDDS) are becoming increasingly important for application areas such as bioinformatics, biometrics, data mining and E-commerce. Efficient similarity searches require robust indexing techniques. Box queries (or window queries) are a type of query which specifies a set of allowed values in each dimension. Unfortunately, exi...

متن کامل

Exploring Intersection Trees for Indexing Metric Spaces

Searching in a dataset for objects that are similar to a given query object is a fundamental problem for several applications that use complex data. The general problem of many similarity measures for complex objects is their computational complexity, which makes them unusable for large databases. Here, we introduce a study of a variant of a metric tree data structure for indexing and querying ...

متن کامل

Slim-Trees: High Performance Metric Trees Minimizing Overlap Between Nodes

In this paper we present the Slim-tree, a dynamic tree for organizing metric datasets in pages of fixed size. The Slim-tree uses the "fat-factor" which provides a simple way to quantify the degree of overlap between the nodes in a metric tree. It is well-known that the degree of overlap directly affects the query performance of index structures. There are many suggestions to reduce overlap in m...

متن کامل

Efficient Querying on Genomic Databases by Using Metric Space Indexing Techniques

A genomic database consists of a set of nucleotide sequences, for which an important kind of queries is the local sequence alignment. This paper investigates two different indexing techniques, namely the variations of GNAT trees [1] and M-trees [3], to support fast query evaluation for local alignment, by transforming the alignment problem to a variant metric space neighborhood search problem.

متن کامل